This vignette demonstrates the multi-group integration framework of MOFA+ on a single data modality.
We consider a data set of scRNA-seq experiments where 16,152 cells were isolated from a total of 8 mouse embryos from developmental stages E6.5, E6.75, E7.0 and E7.25 (two embryos per stage), spanning post-implantation and early gastrulation.
Despite differences in developmental time, embryos are expected to contain similar subpopulations of cells. Hence, MOFA+ should detect the existence of biological sources of variation that are shared between groups.
The data set we use here is a simplified subset version of the original data set, which can be visualised and downloaded from here.
Load dependencies. Make sure that MOFA2 is imported last, to avoid collisions with functions from other packages
Define cell type colors for the visualisations
colors <- c(
"Epiblast" = "grey70",
"Primitive Streak" = "sandybrown",
"Mesoderm" = "violetred",
"ExE endoderm" = "#548B54",
"ExE ectoderm" = "black"
)
The RNA expression data has been processed using Seurat. It has already been normalised and subset to the top 5,000 most variable genes (after regressing out the group effect).
load(url("ftp://ftp.ebi.ac.uk/pub/databases/mofa/scrna_gastrulation/gastrulation10x_seurat.RData"))
Group cells according to the embryo and the stage they comne from
seurat@meta.data$stage_sample <- paste(seurat@meta.data$stage,seurat@meta.data$sample, sep="_")
unique(seurat@meta.data$stage_sample)
## [1] "E6.5_1" "E6.5_5" "E7.0_10" "E7.0_14" "E7.25_26" "E7.25_27"
MOFAobject <- create_mofa(seurat, groups = "stage_sample")
plot_data_overview(MOFAobject)
Data options: let’s use default
data_opts <- get_default_data_options(MOFAobject)
Model options: let’s use default
model_opts <- get_default_model_options(MOFAobject)
Training options
train_opts <- get_default_training_options(MOFAobject)
train_opts$convergence_mode <- "fast"
train_opts$seed <- 42
MOFAobject <- prepare_mofa(
object = MOFAobject,
data_options = data_opts,
model_options = model_opts,
training_options = train_opts
)
This step can take quite some time (~2h with standard CPU inference), so we provide a pre-trained model in the next chunk of code.
Note: if you train the model from scratch, the results will not be 100% reproducible with the pre-trained model (but hopefully similar enough!).
outfile = paste0(getwd(),"/model.hdf5")
model <- run_mofa(MOFAobject, outfile = outfile)
MOFA models are saved in hdf5 format and can be loaded into R with the function load_model. In this case, however, we provide the trained model as an RData file
# model <- load_model(outfile)
# load("/Users/ricard/data/mofa2_vignettes/gastrulation10x_mofa.RData")
load(url("ftp://ftp.ebi.ac.uk/pub/databases/mofa/scrna_gastrulation/gastrulation10x_mofa.RData"))
The MOFAobject consists of multiple slots where relevant data and information is stored. For descriptions, you can read the documentation by ?MOFA
slotNames(model)
## [1] "data" "intercepts" "imputed_data"
## [4] "samples_metadata" "features_metadata" "expectations"
## [7] "training_stats" "training_options" "stochastic_options"
## [10] "data_options" "model_options" "dimensions"
## [13] "on_disk" "dim_red" "cache"
## [16] "status"
head(model@samples_metadata)
## sample stage lineage group
## 1: cell_1 E6.5 Epiblast E6.5 (1)
## 2: cell_2 E6.5 Primitive Streak E6.5 (1)
## 3: cell_5 E6.5 ExE ectoderm E6.5 (1)
## 4: cell_6 E6.5 Epiblast E6.5 (1)
## 5: cell_8 E6.5 Epiblast E6.5 (1)
## 6: cell_9 E6.5 Epiblast E6.5 (1)
The function plot_data_overview can be used to obtain an overview of the input data.
It shows how many views (rows) and how many groups (columns) exist, what are their corresponding dimensionalities and how many missing information they have (grey bars).
In this case we have one view (RNA expression, a total of 5,000 genes) and 6 groups that correspond to different embryos at different stages of development, for a total of 16,152 cells.
plot_data_overview(model, colors = c("RNA"="darkgreen"))
Quantifying the variance explained per factor across groups and views is probably the most important plot that MOFA+ generates. It summarises the (latent) signal from a complex heterogeneous data set in a single figure.
plot_variance_explained(model, x="group", y="factor")
There is a lot of information contained in this plot. Factor 1 and Factor 2 explain a lot of variance across multiple groups. In contrast, Factor 4 increases its activity from E6.5 to E7.5, indicating that it captures a source of variation that emerges at E6.5.
To make the plot less dominated by the top 2 factors, one can scale the R2 values with the arguments min_r2 and max_r2:
plot_variance_explained(model, x="group", y="factor", max_r2 = 0.15)
We can also plot the total variance explained per group (with all factors) by adding the argument plot_total = TRUE. Notably, only 10 factors are sufficient to capture between 35% and 55% of the transcriptional variance per embryo
p <- plot_variance_explained(model, x="group", y="factor", plot_total = T)
p[[2]]
We can also inspect the variance explained by the MOFA factors for individual features. This is useful to disentangle how much of the variation of a gene is “noise” (high variance but low \(R^2\)) versus coordinated variance (high \(R^2\))
features <- c("Rbp4","Ttr","Spink1","Mesp1","E130311K13Rik","Hey1")
Variance explained by all factors across all groups
plot_variance_explained_per_feature(
model,
factors = "all",
groups = "all",
view = "RNA",
features = features
)
Each factor ordinates cells along a one-dimensional axis that is centered at zero. Samples with different signs indicate opposite phenotypes, with higher absolute value indicating a stronger phenotype. For example, if the \(k\)-th factor captures the variability associated with cell cycle, we could expect cells in Mitosis to be at one end of the factor (irrespective of the sign, only the relative positioning being of importance). In contrast, cells in G1 phase are expected to be at the other end of the factor. Cells with intermediate phenotype, or with no clear phenotype (i.e. no cell cycle genes profiled), are expected to be located around zero.
Let’s plot Factor 1 values and we color cells by lineage assignment. Clearly, this factors captures the emergence of ExE endoderm.
plot_factor(model,
factor = 1,
color_by = "lineage" # lineage is a column in model@samples.metadata
) + scale_color_manual(values=colors)
Here are other ways of representing the same plot:
p <- plot_factor(model,
factor = 1,
color_by = "lineage",
dot_size = 0.2, # change dot size
dodge = T, # dodge points with different colors
legend = F, # remove legend
add_violin = T, # add violin plots,
violin_alpha = 0.25 # transparency of violin plots
)
p <- p +
scale_color_manual(values=colors) +
scale_fill_manual(values=colors)
p
One can also change the default groups by some manually defined grouping structure. For example a column in the sample metadata
plot_factor(model,
factor = 1,
color_by = "lineage",
group_by = "stage", # cells are now grouped by stage, rather than stage+embryo,
legend = F
) + scale_color_manual(values=colors)
Combinations of factors can be plotted with plot_factors:
plot_factors(model,
factors = c(1,4),
color_by = "lineage"
) + scale_color_manual(values=colors)
The weights provide a score for each gene on each factor. Genes with no association with the factor are expected to have values close to zero, whereas genes with strong association with the factor are expected to have large absolute values. The sign of the weight indicates the direction of the effect: a positive weight indicates that the feature is more active in the cells with positive factor values, and viceversa. \ Following the cell cycle example from above, we expect genes that are upregulated in the M phase to have large positive weights, whereas genes that are downregulated in the M phase (or, equivalently, upregulated in the G1 phase) are expected to have large negative weights.\
Let’s plot the distribution of weights for Factor 1.
plot_weights(model,
view = "RNA",
factor = 1,
nfeatures = 10, # Top number of features to highlight
scale = T # Scale weights from -1 to 1
)
If we are not interested in the directionality of the effect, we can take the absolute value of the weights (abs=TRUE). We can also highlight some genes of interest using the argument manual to see where in the distribution they lie:
plot_weights(model,
view = "RNA",
factor = 1,
nfeatures = 5,
manual = list(c("Snai1","Mesp1","Phlda2"), c("Rhox5","Elf5")),
color_manual = c("darkgreen","red"),
scale = T,
abs = T
)
If you are not interested in the full distribution, but just on the top weights, you can do:
plot_top_weights(model,
view = "RNA",
factor = 1,
nfeatures = 10,
scale = T,
abs = T
)
We expect negative weights For Factor 1 to be marker genes of ExE Endoderm. If we plot Factor 1 colouring cells by gene expresion of the top genes:
genes <- c("Ttr","Apom","Apoa1","Amn")
for (i in genes) {
p <- plot_factor(model,
factor = 1,
color_by = i
) + scale_colour_gradientn(colours = terrain.colors(10)) # change color scale
print(p)
}
The weights are useful to get an idea of which are top genes that drive the factors. However, to get an idea of how well Factors are associated with genomic features we can generate a scatter plot of the Factor values against gene expression for the genes with the highest weights:
Positive weights
p <- plot_data_scatter(model,
view = "RNA",
factor = 1,
features = 6, # Number of features to show
sign = "positive", # select top 6 features with positive weights
color_by = "lineage", # color cells by lineage
add_lm = TRUE # add linear regression estimates
)
p <- p + scale_color_manual(values=colors) + theme(legend.position = "none")
print(p)
Negative weights
p <- plot_data_scatter(model,
view = "RNA",
factor = 1,
features = 6,
sign = "negative",
color_by = "lineage",
add_lm = TRUE
)
p <- p + scale_color_manual(values=colors) + theme(legend.position = "none")
print(p)
The latent space inferred by MOFA can be employed as input to other single-cell algorithms that learn non-linear manifolds such as UMAP or t-SNE. This can be very useful to identify cellular populations and reconstruct complex pseudotime trajectories.
Run t-SNE (takes ~2min):
set.seed(42)
model <- run_tsne(model)
In this data set, we see that the combination of MOFA factors have enough information to discriminate all cell types.
plot_dimred(model,
method = "TSNE",
color_by = "lineage"
) + scale_color_manual(values=colors)
This is not surprising and yields a similar result as running PCA (after batch correction). Let’s move to more exciting applications in the next vignettes!
sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] MOFA2_0.99.1 ggplot2_3.2.1 Seurat_3.1.1 BiocStyle_2.12.0
##
## loaded via a namespace (and not attached):
## [1] ggbeeswarm_0.6.0 Rtsne_0.15 colorspace_1.4-1
## [4] ggsignif_0.6.0 ellipsis_0.3.0 ggridges_0.5.1
## [7] ggpubr_0.2.3 leiden_0.3.1 listenv_0.7.0
## [10] npsurv_0.4-0 ggrepel_0.8.1 codetools_0.2-16
## [13] splines_3.6.2 R.methodsS3_1.7.1 doParallel_1.0.15
## [16] lsei_1.2-0 knitr_1.25 zeallot_0.1.0
## [19] jsonlite_1.6 ica_1.0-2 cluster_2.1.0
## [22] png_0.1-7 R.oo_1.22.0 pheatmap_1.0.12
## [25] uwot_0.1.4 HDF5Array_1.12.3 sctransform_0.2.0
## [28] BiocManager_1.30.9 compiler_3.6.2 httr_1.4.1
## [31] backports_1.1.5 assertthat_0.2.1 Matrix_1.2-18
## [34] lazyeval_0.2.2 htmltools_0.4.0 tools_3.6.2
## [37] rsvd_1.0.2 igraph_1.2.4.1 gtable_0.3.0
## [40] glue_1.3.1 RANN_2.6.1 reshape2_1.4.3
## [43] dplyr_0.8.3 Rcpp_1.0.2 vctrs_0.2.0
## [46] gdata_2.18.0 ape_5.3 nlme_3.1-142
## [49] iterators_1.0.12 gbRd_0.4-11 lmtest_0.9-37
## [52] xfun_0.10 stringr_1.4.0 globals_0.12.4
## [55] lifecycle_0.1.0 irlba_2.3.3 gtools_3.8.1
## [58] future_1.14.0 MASS_7.3-51.4 zoo_1.8-6
## [61] scales_1.0.0 parallel_3.6.2 rhdf5_2.28.1
## [64] RColorBrewer_1.1-2 yaml_2.2.0 reticulate_1.13
## [67] pbapply_1.4-2 gridExtra_2.3 stringi_1.4.3
## [70] S4Vectors_0.22.1 corrplot_0.84 foreach_1.4.7
## [73] caTools_1.17.1.2 BiocGenerics_0.30.0 BiocParallel_1.18.1
## [76] bibtex_0.4.2 Rdpack_0.11-0 SDMTools_1.1-221.1
## [79] rlang_0.4.1 pkgconfig_2.0.3 bitops_1.0-6
## [82] matrixStats_0.55.0 evaluate_0.14 lattice_0.20-38
## [85] Rhdf5lib_1.6.3 ROCR_1.0-7 purrr_0.3.3
## [88] labeling_0.3 htmlwidgets_1.5.1 cowplot_1.0.0
## [91] tidyselect_0.2.5 RcppAnnoy_0.0.13 plyr_1.8.4
## [94] magrittr_1.5 bookdown_0.14 R6_2.4.0
## [97] IRanges_2.18.3 gplots_3.0.1.1 DelayedArray_0.10.0
## [100] pillar_1.4.2 withr_2.1.2 fitdistrplus_1.0-14
## [103] survival_3.1-8 tibble_2.1.3 future.apply_1.3.0
## [106] tsne_0.1-3 crayon_1.3.4 KernSmooth_2.23-16
## [109] plotly_4.9.0 rmarkdown_1.16 grid_3.6.2
## [112] data.table_1.12.6 forcats_0.4.0 metap_1.1
## [115] digest_0.6.22 tidyr_1.0.0 R.utils_2.9.0
## [118] RcppParallel_4.4.4 stats4_3.6.2 munsell_0.5.0
## [121] beeswarm_0.2.3 viridisLite_0.3.0 vipor_0.4.5